Housing Price Prediction via Improved Machine Learning Techniques
Reproduction of Quang Truong, Minh Nguyen, Hy Dang and Bo Mei research
The purpose of the original article was to prepare a model predicting housing prices in Beijing and to determine the impact of various characteristics on property prices. As part of our work on the reproduction of a given article, we used the R language with its libraries (the analysis in the initial article was carried out with the help of Python).
Data
Read data with “gbk” encoding for Chinese signs and getting some columns insights
```{r}
#read csv as dataframe
housing_prices_data <- as.data.frame(read.csv("new.csv",fileEncoding="gbk", header = TRUE)) #fileEncoding='gbk' is chinese signs encoding
str(housing_prices_data)
```'data.frame': 318851 obs. of 26 variables:
$ url : chr "https://bj.lianjia.com/chengjiao/101084782030.html" "https://bj.lianjia.com/chengjiao/101086012217.html" "https://bj.lianjia.com/chengjiao/101086041636.html" "https://bj.lianjia.com/chengjiao/101086406841.html" ...
$ id : chr "101084782030" "101086012217" "101086041636" "101086406841" ...
$ Lng : num 116 116 117 116 116 ...
$ Lat : num 40 39.9 39.9 40.1 39.9 ...
$ Cid : num 1.11e+12 1.11e+12 1.11e+12 1.11e+12 1.11e+12 ...
$ tradeTime : chr "2016-08-09" "2016-07-28" "2016-12-11" "2016-09-30" ...
$ DOM : num 1464 903 1271 965 927 ...
$ followers : int 106 126 48 138 286 57 167 138 218 134 ...
$ totalPrice : num 415 575 1030 298 392 ...
$ price : int 31680 43436 52021 22202 48396 52000 37672 49521 27917 55883 ...
$ square : num 131 132 198 134 81 ...
$ livingRoom : chr "2" "2" "3" "3" ...
$ drawingRoom : chr "1" "2" "2" "1" ...
$ kitchen : int 1 1 1 1 1 1 1 1 1 0 ...
$ bathRoom : chr "1" "2" "3" "1" ...
$ floor : chr "高 26" "高 22" "中 4" "底 21" ...
$ buildingType : num 1 1 4 1 4 4 4 1 3 1 ...
$ constructionTime : chr "2005" "2004" "2005" "2008" ...
$ renovationCondition: int 3 4 3 1 2 3 4 4 1 4 ...
$ buildingStructure : int 6 6 6 6 2 6 2 6 2 6 ...
$ ladderRatio : num 0.217 0.667 0.5 0.273 0.333 0.333 0.5 0.667 0.333 0.308 ...
$ elevator : num 1 1 1 1 0 1 0 1 0 1 ...
$ fiveYearsProperty : num 0 1 0 0 1 1 0 1 0 1 ...
$ subway : num 1 0 0 0 1 0 0 0 0 1 ...
$ district : int 7 7 7 6 1 7 7 7 13 1 ...
$ communityAverage : num 56021 71539 48160 51238 62588 ...
Data cleaning
In the original article, attributes describing the number of kitchens, bathrooms and drawing rooms were removed due to their ambiguity. We decided to keep them and give them a chance in further analysis. Columns such as “ID” or “URL” were removed. As in the article, we removed the variable “DOM” (“Day on market”) because as many as 49.5% of the observations contained empty values for this variable. As in the article, we decided to remove records containing any null values - the percentage of such cases was very low, so this decision did not significantly affect further work.
```{r}
#drop unnecessary columns
drop_cols <- c("url","id")
housing_prices_data <- housing_prices_data[ , !(names(housing_prices_data) %in% drop_cols)]
# finds the count of missing values as % of the whole dataset
colMeans(is.na(housing_prices_data))*100
housing_prices_data <- housing_prices_data[ , !(names(housing_prices_data) %in% "DOM")]
#Now we are going to check how many rows have missing values
obs_with_nulls <- housing_prices_data[!complete.cases(housing_prices_data),]
#Distributions of full dataset and null-rows-dataset are similar. Also there is only 2403 obs of rows with null values.
#Then we can delete rows with NULLs.
housing_prices_data_clean <- na.omit(housing_prices_data)
``` Lng Lat Cid tradeTime
0.00000000 0.00000000 0.00000000 0.00000000
DOM followers totalPrice price
49.54571257 0.00000000 0.00000000 0.00000000
square livingRoom drawingRoom kitchen
0.00000000 0.00000000 0.00000000 0.00000000
bathRoom floor buildingType constructionTime
0.00000000 0.00000000 0.63383838 0.00000000
renovationCondition buildingStructure ladderRatio elevator
0.00000000 0.00000000 0.00000000 0.01003604
fiveYearsProperty subway district communityAverage
0.01003604 0.01003604 0.00000000 0.14520889
Data preprocessing
In the process of preparing the data for modeling, we performed the necessary transformations of the variables. From the variable “floor” we extracted the height of the floor, getting rid of its type in contrast to the original article. The type describes the height just like a numerical value, so it doesn’t make sense to keep both variables. We replaced the “Unknown” values of the Age calculated variable with the average of ConstructionTime. We also truncated categories such as “renovationCondition” to reduce the number of levels of variables with similar or nearly identical business overtones. We created new additional variables that were not created as part of the initial article - such as “avgRoomSize,” which describes the average size of usable rooms in a given apartment. We also got rid of outliers using Inter-Quartile Range (IQR).
```{r}
summary(housing_prices_data_clean)
#converting some variables to number
housing_prices_data_clean$livingRoom <- as.integer(housing_prices_data_clean$livingRoom)
housing_prices_data_clean$drawingRoom <- as.integer(housing_prices_data_clean$drawingRoom)
housing_prices_data_clean$bathRoom <- as.integer(housing_prices_data_clean$bathRoom)
#Division of signs and numbers (floor type and height)
housing_prices_data_clean$floorType <- substring(housing_prices_data_clean$floor,1,2)
housing_prices_data_clean$floorNum <- as.integer(substring(housing_prices_data_clean$floor,3,length(housing_prices_data_clean$floor)-2))
housing_prices_data_clean <- housing_prices_data_clean[ , !(names(housing_prices_data_clean) %in% "floor")]
# Floor types translation - is like a group of floorNum then it can be removed
# 高 - High
# 未知 - Unknown
# 顶 - Top
# 低 - Low
# 底 - Bottom
# 中 - Medium
#Calculation of distance between home and Beijing city center (Forbidden City coordinates in our case)
BeijingCenterLat <- 39.91690639140218
BeijingCenterLng <- 116.39716443298232
#Haversine distance to get result in kilometers
housing_prices_data_clean$distance <- distHaversine(p1=housing_prices_data_clean[,c("Lng","Lat")],c(BeijingCenterLng,BeijingCenterLat))/1000
#Age -> construction time - current year
current_year <- as.numeric(format(Sys.Date(),"%Y"))
#Check frequency
table(housing_prices_data_clean$constructionTime)
#If unknown then use "Average" OR Maybe we should get rid off AGE/Construction Time variable
meanConcstrTime <- round(mean(as.integer(housing_prices_data_clean$constructionTime),na.rm = TRUE))
housing_prices_data_clean$age <- ifelse(housing_prices_data_clean$constructionTime=='未知',current_year-meanConcstrTime,current_year - as.integer(housing_prices_data_clean$constructionTime))
housing_prices_data_clean <- housing_prices_data_clean[ , !(names(housing_prices_data_clean) %in% "constructionTime")]
##changing numeric to categories factors
housing_prices_data_clean$buildingType <- ifelse(housing_prices_data_clean$buildingType==1,"Tower",ifelse(housing_prices_data_clean$buildingType==2,"Bungalow",ifelse(housing_prices_data_clean$buildingType==3,"Plate and Tower","Plate")))
#Bungalow building Types will be deleted from dataset since they are outliers and bungalow is completely different than other types
housing_prices_data_clean <- housing_prices_data_clean[housing_prices_data_clean$buildingType != "Bungalow", ]
housing_prices_data_clean$buildingType <- as.factor(housing_prices_data_clean$buildingType)
housing_prices_data_clean$renovationCondition <- ifelse(housing_prices_data_clean$renovationCondition==1,"Other",ifelse(housing_prices_data_clean$renovationCondition==2,"Rough",ifelse(housing_prices_data_clean$renovationCondition==3,"Simplicity","Hardcover")))
# Rough renovation condition will be assign to Hardcover
housing_prices_data_clean$renovationCondition <- as.factor(ifelse(housing_prices_data_clean$renovationCondition=="Rough","Hardcover",housing_prices_data_clean$renovationCondition))
housing_prices_data_clean$buildingStructure <- ifelse(housing_prices_data_clean$buildingStructure==1,"Unknow",ifelse(housing_prices_data_clean$buildingStructure==2,"Mixed",ifelse(housing_prices_data_clean$buildingStructure==3,"Brick and wood",ifelse(housing_prices_data_clean$buildingStructure==4,"Brick and concrete",ifelse(housing_prices_data_clean$buildingStructure==5,"Steel","Steel-concrete composite")))))
##Dealing with outliers
housing_prices_data_clean$buildingStructure <- as.factor(ifelse(housing_prices_data_clean$buildingStructure=="Steel","Steel-concrete composite",ifelse(housing_prices_data_clean$buildingStructure=="Brick and wood","Mixed",ifelse(housing_prices_data_clean$buildingStructure=="Unknow","Mixed",housing_prices_data_clean$buildingStructure))))
#housing_prices_data_clean$floorType <- as.factor(housing_prices_data_clean$floorType)
housing_prices_data_clean$elevator <- ifelse(housing_prices_data_clean$elevator==1,"Elevator","noElevator")
housing_prices_data_clean$elevator <- as.factor(housing_prices_data_clean$elevator)
housing_prices_data_clean$fiveYearsProperty <- ifelse(housing_prices_data_clean$fiveYearsProperty==1,"isFiveYProp","noFiveYProp")
housing_prices_data_clean$fiveYearsProperty <- as.factor(housing_prices_data_clean$fiveYearsProperty)
housing_prices_data_clean$subway <- ifelse(housing_prices_data_clean$subway==1,"Subway","NoSubway")
housing_prices_data_clean$subway <- as.factor(housing_prices_data_clean$subway)
housing_prices_data_clean$district <- as.factor(housing_prices_data_clean$district)
#char to Date
housing_prices_data_clean$tradeTime <- as.Date(housing_prices_data_clean$tradeTime)
##New feature - average size of a room
housing_prices_data_clean$avgRoomSize <- housing_prices_data_clean$square/(housing_prices_data_clean$livingRoom + housing_prices_data_clean$drawingRoom + housing_prices_data_clean$kitchen + housing_prices_data_clean$bathRoom)
housing_prices_data_clean$totalPrice <- housing_prices_data_clean$totalPrice * 10000 #real scale
#########
##INFO:
## totalPrice is price (average price by sqrt * square) * 10 000
##Currency is yuan
########
``` Lng Lat Cid tradeTime
Min. :116.1 Min. :39.63 Min. :1.111e+12 Length:316448
1st Qu.:116.3 1st Qu.:39.89 1st Qu.:1.111e+12 Class :character
Median :116.4 Median :39.93 Median :1.111e+12 Mode :character
Mean :116.4 Mean :39.95 Mean :1.125e+12
3rd Qu.:116.5 3rd Qu.:40.00 3rd Qu.:1.111e+12
Max. :116.7 Max. :40.25 Max. :1.185e+14
followers totalPrice price square
Min. : 0.00 Min. : 0.1 Min. : 1 Min. : 7.37
1st Qu.: 0.00 1st Qu.: 205.0 1st Qu.: 28090 1st Qu.: 57.90
Median : 5.00 Median : 294.0 Median : 38778 Median : 74.16
Mean : 16.71 Mean : 347.9 Mean : 43550 Mean : 82.84
3rd Qu.: 18.00 3rd Qu.: 425.0 3rd Qu.: 53857 3rd Qu.: 98.48
Max. :1143.00 Max. :4900.0 Max. :156250 Max. :640.00
livingRoom drawingRoom kitchen bathRoom
Length:316448 Length:316448 Min. :0.0000 Length:316448
Class :character Class :character 1st Qu.:1.0000 Class :character
Mode :character Mode :character Median :1.0000 Mode :character
Mean :0.9946
3rd Qu.:1.0000
Max. :3.0000
floor buildingType constructionTime renovationCondition
Length:316448 Min. :1.00 Length:316448 Min. :1.000
Class :character 1st Qu.:1.00 Class :character 1st Qu.:1.000
Mode :character Median :4.00 Mode :character Median :3.000
Mean :3.01 Mean :2.607
3rd Qu.:4.00 3rd Qu.:4.000
Max. :4.00 Max. :4.000
buildingStructure ladderRatio elevator fiveYearsProperty
Min. :1.000 Min. : 0 Min. :0.0000 Min. :0.0000
1st Qu.:2.000 1st Qu.: 0 1st Qu.:0.0000 1st Qu.:0.0000
Median :6.000 Median : 0 Median :1.0000 Median :1.0000
Mean :4.452 Mean : 64 Mean :0.5799 Mean :0.6467
3rd Qu.:6.000 3rd Qu.: 0 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :6.000 Max. :10009400 Max. :1.0000 Max. :1.0000
subway district communityAverage
Min. :0.0000 Min. : 1.000 Min. : 10847
1st Qu.:0.0000 1st Qu.: 6.000 1st Qu.: 46410
Median :1.0000 Median : 7.000 Median : 59025
Mean :0.6026 Mean : 6.767 Mean : 63784
3rd Qu.:1.0000 3rd Qu.: 8.000 3rd Qu.: 76001
Max. :1.0000 Max. :13.000 Max. :183109
1950 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963
11 5 13 61 61 77 66 124 49 175 8 21 89
1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976
103 190 93 73 4 4 280 8 32 167 178 382 437
1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
501 731 1749 3295 2080 2766 2879 3344 4693 4850 4923 5557 5482
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002
8368 4306 8472 6746 7713 9042 9042 6442 11281 10163 13809 10525 12287
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
19252 21003 18734 14739 14126 12095 11506 7213 5422 4982 2266 2079 445
2016 未知
82 18747
In addition to using the IQR method to detect outliers, we also analyzed the distributions of the variables.
```{r}
#density of totalPrice
ggplot(housing_prices_data_clean, aes(x = totalPrice)) +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "white") +
geom_density()
ggplot(housing_prices_data_clean, aes(x = followers)) +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "white") +
geom_density()
ggplot(housing_prices_data_clean, aes(x = floorNum)) +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "white") +
geom_density()
ggplot(housing_prices_data_clean, aes(x = communityAverage)) +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "white") +
geom_density()
ggplot(housing_prices_data_clean, aes(x = square)) +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "white") +
geom_density()
ggplot(housing_prices_data_clean, aes(x = price)) +
geom_histogram(aes(y = ..density..),
colour = 1, fill = "white") +
geom_density()
boxplot(housing_prices_data_clean$totalPrice)
#Looking for outliers in numerical variables using IQR
q<-NULL
iqr<-NULL
upper<-NULL
lower<-NULL
IQRcutData <- function(data, lower_quantile = 0.25, upper_quantile = 0.75)
{
q <<- quantile(data, probs=c(lower_quantile, upper_quantile),na.rm=TRUE)
iqr <<- q[2]-q[1]
upper <<- q[2] + 1.5*iqr
lower <<-q[1] - 1.5*iqr
#outliers_totalPrice <- data > upper | data < lower
data_cut <- data
data_cut[data < lower] <-lower
data_cut[data > upper] <- upper
return(data_cut)
}
########################################################## Outliers IQR
cut_totalPrice <-IQRcutData(housing_prices_data_clean$totalPrice)
housing_prices_data_clean$totalPrice <- cut_totalPrice
cut_followers <- IQRcutData(housing_prices_data_clean$followers)
housing_prices_data_clean$followers <- cut_followers
cut_floorNum <- IQRcutData(housing_prices_data_clean$floorNum)
housing_prices_data_clean$floorNum <- cut_floorNum
cut_communityAverage <- IQRcutData(housing_prices_data_clean$communityAverage)
housing_prices_data_clean$communityAverage <- cut_communityAverage
cut_square <- IQRcutData(housing_prices_data_clean$square)
housing_prices_data_clean$square <- cut_square
cut_price <- IQRcutData(housing_prices_data_clean$price)
housing_prices_data_clean$price <- cut_price
```Data Analysis
At the first step of data analysis process we have decided to perform correlation analysis. In the article, correlation analysis was carried out only for selected variables. We approached the issue holistically.
```{r}
#Correlation matrix for numerics
corrData <- housing_prices_data_clean
drop_colsCorr2 <- c("tradeTime","Cid","floorType")
drop_colsCorr <- c("Cid","Lng","Lat","tradeTime","buildingType","renovationCondition","buildingStructure","floorType","elevator","fiveYearsProperty","subway","district")
corrData <- corrData[ , !(names(corrData) %in% drop_colsCorr)]
res <- cor(corrData)
#corrplot(res,method="number")
col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(res, method="color", col=col(200),
type="upper", order="hclust",
addCoef.col = "black", # Add coefficient of correlation
tl.col="black", tl.srt=45, #Text label color and rotation
# Combine with significance
sig.level = 0.01, insig = "blank",
# hide correlation coefficient on the principal diagonal
diag=FALSE , number.cex = 0.4, tl.cex = 0.5
)
```Thus, we could see a strong positive correlation of variables such as “totalPrice”, “price” and “communityAverage”. The new variable “avgRoomSize” is mildly correlated with the variable “square,” but has nothing to do with the variables describing the number of rooms, which is a good sign from a modeling perspective. The variables “distance” and “communityAverage” showed a strong negative correlation. We also performed correlation verification on one-hot-encoded variables.
```{r}
#One-hot encoding - correlation matrix of all vars
housing_prices_data_clean <- housing_prices_data_clean[ , !(names(housing_prices_data_clean) %in% drop_colsCorr2)]
# Convert factor variables to dummy variables
dummy_vars <- lapply(housing_prices_data_clean[, sapply(housing_prices_data_clean, is.factor)], function(x) model.matrix(~ x - 1, data = housing_prices_data_clean))
# Combine dummy variables with numeric variables
housing_prices_data_clean_onehot <- cbind(housing_prices_data_clean[, !sapply(housing_prices_data_clean, is.factor)], do.call(cbind, dummy_vars))
# Calculate correlation matrix
correlation_matrix <- cor(housing_prices_data_clean_onehot)
# Visualize correlation matrix as heatmap
ggplot(data = melt(correlation_matrix), aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0, name="Correlation") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "Correlation Plot")
```The multitude of variables did not facilitate the analysis, but interesting conclusions were nevertheless drawn. Floor height is negatively correlated with certain types of buildings. High floors are not built in “Plate” type properties. In addition, the lack of an elevator is strongly negatively correlated with high floors.
We also prepared a standardized version of the dataset for some models.
```{r}
#restoring tradeTime
housing_prices_data_clean_onehot$tradeTime <- housing_prices_data_clean$tradeTime
## Data for modeling
# Custom function to standardize numeric and integer columns
standardize_cols <- function(x) {
if (is.numeric(x) || is.integer(x)) {
return((x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE))
} else {
return(x)
}
}
# Standardize numeric and integer columns
housing_prices_data_clean_stand <- housing_prices_data_clean %>%
mutate(across(where(is.numeric) | where(is.integer), standardize_cols))
#one-hot encoding once again
# Convert factor variables to dummy variables
dummy_vars <- lapply(housing_prices_data_clean_stand[, sapply(housing_prices_data_clean_stand, is.factor)], function(x) model.matrix(~ x - 1, data = housing_prices_data_clean_stand))
# Combine dummy variables with numeric variables
housing_prices_data_clean_stand_onehot <- cbind(housing_prices_data_clean_stand[, !sapply(housing_prices_data_clean_stand, is.factor)], do.call(cbind, dummy_vars))
```Geolocalization with distribution of Age/Price
```{r}
ggplot(housing_prices_data_clean,aes(x=Lat,y=Lng,group=age))+
geom_point(aes(color=age)) +
scale_colour_gradient(low="#66C2A5", high="#FC8D62")
```With the plot above we have agreed outcomes from original paper that the oldest houses are settled in the city center mostly. Younger homes are in the suburbs.
```{r}
ggplot(housing_prices_data_clean,aes(x=Lat,y=Lng,group=price))+
geom_point(aes(color=price)) +
scale_colour_gradient(low="#66C2A5", high="#FC8D62")
```Similar situation as before. The most expensive parcels can be found in the city center.
```{r}
ggplot(housing_prices_data_clean, aes(x=district, y=price, fill=district)) +
geom_boxplot(alpha=0.3) +
theme(legend.position="none") +
scale_fill_brewer(palette="BuPu")
```From the plot above we can find out that the most expensive homes on average are in districts 1, 10 and 8.
```{r}
ggplot(housing_prices_data_clean, aes(x=buildingType, y=price, fill=buildingType)) +
geom_boxplot(alpha=0.3) +
theme(legend.position="none") +
scale_fill_brewer(palette="BuPu")
``````{r}
ggplot(housing_prices_data_clean, aes(x=buildingType, y=square, fill=buildingType)) +
geom_boxplot(alpha=0.3) +
theme(legend.position="none") +
scale_fill_brewer(palette="BuPu")
```Type of a building does not influence the price or square meters.
Random Forest
Now we reproduce the Random Forest model. The ranger method is used for the Random Forest algorithm with 900 trees and 10 threads for parallel processing to enhance computation speed. The model’s importance measure is set to “impurity”. Pre-processing steps include centering and scaling the data to standardize it. Cross-validation with 10 folds is controlled through ctrl_cv10, and the tuning grid specifies parameters for the model: a maximum of 20 variables to try at each split (mtry), a splitting rule based on variance, and a minimum node size of 100.
Data Preparation for Modeling and setting the evaluation parametrs
```{r}
# Splitting the dataset fot train and test
set.seed(123456789)
data_which_train <- createDataPartition(housing_prices_data_clean$totalPrice , # target variable
p = 0.8,
list = FALSE)
data_train <- housing_prices_data_clean[data_which_train,]
data_test <- housing_prices_data_clean[-data_which_train,]
# function for the model evaluation
regressionMetrics <- function(data,
lev = NULL,
model = NULL) {
real <- data$obs
predicted <- data$pred
# Mean Square Error
MSE <- mean((real - predicted)^2)
# Root Mean Square Error
RMSE <- sqrt(MSE)
# Mean Absolute Error
MAE <- mean(abs(real - predicted))
# Mean Absolute Percentage Error
MAPE <- mean(abs(real - predicted)/real)
# Median Absolute Error
MedAE <- median(abs(real - predicted))
# Mean Logarithmic Absolute Error
MSLE <- mean((log(1 + real) - log(1 + predicted))^2)
# Root Mean Squared Logarithmic Error (RMSLE)
RMSLE <- sqrt(MSLE)
# R2
R2 <- cor(predicted, real)^2
result <- c(MSE = MSE, RMSE = RMSE, MAE = MAE, MAPE = MAPE, MedAE = MedAE,
MSLE = MSLE, RMSLE = RMSLE, R2 = R2)
return(result)
}
# 10-fold cross validation
ctrl_cv10 <- trainControl(method = "cv",
number = 10,
savePredictions = "final",
summaryFunction = regressionMetrics)
```Model Parameters
The parameters for the model are defined based on the article.
```{r}
# set.seed(123456789)
# data_rf <-
# caret::train(totalPrice ~ .,
# data = data_train,
# method = "ranger",
# num.trees = 900,
# num.threads = 10,
# importance = "impurity",
# preProcess = c("center", "scale"),
# trControl = ctrl_cv10,
# tuneGrid = expand.grid(mtry = 20,
# splitrule = "variance",
# min.node.size = 100))
# loading from saved file
data_rf <- readRDS("data_rf_model.rds")
data_rf
```Random Forest
253113 samples
23 predictor
Pre-processing: centered (37), scaled (37)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 227802, 227802, 227800, 227803, 227803, 227802, ...
Resampling results:
MSE RMSE MAE MAPE MedAE MSLE RMSLE
8786246935 93695.62 29103.36 0.2523139 7882.703 0.02567119 0.1590321
R2
0.9971253
Tuning parameter 'mtry' was held constant at a value of 20
Tuning
parameter 'splitrule' was held constant at a value of variance
Tuning parameter 'min.node.size' was held constant at a value of 100
The paper’s model performs better in terms of RMSLE, indicating it has lower prediction error when the logarithmic scale is used. The RMSLE of 0.12980 is indeed lower than 0.1590321, suggesting that our model’s predictions are slightly less accurate than those of the paper’s model on the training data.
eXtreme Gradient Boosting
Model Parameters
The parameters for the model are defined based on the article.
```{r}
# set.seed(123456789)
#
# data_xgboost <-
# caret::train(totalPrice ~ .,
# data = data_train,
# method = "xgbTree",
# preProcess = c("center", "scale"),
# trControl = ctrl_cv10,
# tuneGrid = expand.grid(nrounds = 200,
# max_depth = 5,
# eta = 0.1,
# gamma = 0.5,
# colsample_bytree = 0.8,
# min_child_weight = 2,
# subsample = 1))
# loading from saved file
data_xgboost <- readRDS("data_xgboost_model.rds")
data_xgboost
```eXtreme Gradient Boosting
253113 samples
23 predictor
Pre-processing: centered (37), scaled (37)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 227802, 227802, 227800, 227803, 227803, 227802, ...
Resampling results:
MSE RMSE MAE MAPE MedAE MSLE RMSLE R2
8849962757 93973.28 54423.7 0.0759587 34301.51 NaN NaN 0.9970425
Tuning parameter 'nrounds' was held constant at a value of 200
Tuning
held constant at a value of 2
Tuning parameter 'subsample' was held
constant at a value of 1
eXtreme Gradient Boosting model shows strong performance metrics with preprocessing steps of centering and scaling. The model was evaluated using 10-fold cross-validation, yielding an MSE of 8,849,962,757 and an RMSE of 93,973.28. The MAE is 54,423.7, and the MAPE is 0.0759587, indicating low percentage errors. The MedAE is 34,301.51, and the R² value of 0.9970425 suggests that the model explains over 99% of the variance in the target variable. However, MSLE and RMSLE values are missing, making it difficult to compare directly with the paper’s reported RMSLE of 0.16118.
LightGBM Model
In this section, we reproduce the results from the referenced article using the LightGBM model. LightGBM, or Light Gradient Boosting Machine, is a fast and efficient machine learning algorithm that excels at handling large datasets. It is used for both classification and regression tasks. LightGBM works by building an ensemble of decision trees in a sequential manner, where each tree is trained to correct the errors of the previous ones. Unlike other algorithms that grow trees level by level, LightGBM grows trees leaf by leaf, which often results in more accurate models. This approach is guided by gradient descent, which helps minimize errors. LightGBM is designed for speed and scalability, making it suitable for large-scale data. It includes various hyperparameters for fine-tuning, which can significantly improve its performance. This makes LightGBM a popular choice for many machine learning tasks due to its high efficiency and accuracy.
The parameters used in the model are consistent with those specified in the article, ensuring comparability of results. Specifically, we optimized tree-specific parameters: min_child_weight = 2, num_leaves = 36, colsample_bytree = 0.8, learning_rate = 0.15. Additionally, we set the regularization parameter reg_lambda = 0.40.
Data Preparation for Modeling
We begin with data preparation for modeling such as defining the features and target variable for the model and splitting the dataset into training and testing sets.
```{r}
# Defining features and target
features <- setdiff(names(housing_prices_data_clean), "totalPrice")
target <- "totalPrice"
# Splitting the data into training and testing sets
set.seed(123456789)
data_which_train <- createDataPartition(housing_prices_data_clean$totalPrice ,
p = 0.8,
list = FALSE)
train_data <- housing_prices_data_clean[data_which_train,]
test_data <- housing_prices_data_clean[-data_which_train,]
# Preparing data for LightGBM
trainMatrix <- lgb.Dataset(
data = as.matrix(train_data[, -which(names(train_data) == "totalPrice")]),
label = train_data$totalPrice)
testMatrix <- lgb.Dataset(
data = as.matrix(test_data[, -which(names(test_data) == "totalPrice")]),
label = test_data$totalPrice)
```Model Parameters
The parameters for the LightGBM model are defined based on the article.
```{r}
# Defining LightGBM parameters based on the article
params <- list(objective = "regression",
metric = "rmse",
min_child_weight = 2,
num_leaves = 36,
colsample_bytree = 0.8,
reg_lambda = 0.40,
learning_rate = 0.15,
feature_fraction = 0.9 )
```Training the Model
We train the LightGBM model with the specified parameters.
```{r message=FALSE, warning=FALSE, results = "hide"}
# Training the LightGBM model
model_lgbm <- lgb.train(params,
trainMatrix,
nrounds = 1000,
valids = list(test = testMatrix),
early_stopping_rounds = 10)
```Making Predictions
The model’s predictions are made and adjusted to handle negative values.
```{r}
# Prediction
test_predictions <- predict(model_lgbm, as.matrix(test_data[, -which(names(test_data) == "totalPrice")]))
train_predictions <- predict(model_lgbm, as.matrix(train_data[, -which(names(train_data) == "totalPrice")]))
# Excluding negative predicted values
test_predictions <- ifelse(test_predictions < 0, 0.01, test_predictions)
train_predictions <- ifelse(train_predictions < 0, 0.01, train_predictions)
```Evaluation Metrics
We evaluate the model using RMSLE and MAPE metrics.
```{r}
# Evaluation metrics
# RMSLE and MAPE for test data
test_rmsle_score <- rmsle(test_data[[target]], test_predictions)
test_mape_score <- mape(test_data[[target]], test_predictions)
# RMSLE and MAPE for train data
train_rmsle_score <- rmsle(train_data[[target]], train_predictions)
train_mape_score <- mape(train_data[[target]], train_predictions)
cat("Test RMSLE: ", test_rmsle_score, "\n")
cat("Test MAPE: ", test_mape_score, "\n")
cat("Train RMSLE: ", train_rmsle_score, "\n")
cat("Train MAPE: ", train_mape_score, "\n")
```Test RMSLE: 0.1501456
Test MAPE: 0.17798
Train RMSLE: 0.1391439
Train MAPE: 0.1163801
Plotting Results
Finally, we plot the actual vs. predicted values for both training and testing data.
```{r}
# Dataframe for plotting
plot_test <- data.frame(
Real = test_data[[target]],
Predicted = test_predictions )
plot_train <- data.frame(
Real = train_data[[target]],
Predicted = train_predictions)
# Plot
test_p <- ggplot(plot_test, aes(x = Predicted, y = Real)) +
geom_point(alpha = 0.6, color = "lightblue") +
geom_abline(slope = 1, intercept = 0, color = "darkorange", lwd = 0.7) +
labs(
x = "Light GBM Predictions",
y = "Actual Price" ,
subtitle = "Test Data"
) +
scale_x_continuous(labels = scales::label_number(scale = 1e-6, suffix = "M")) +
scale_y_continuous(labels = scales::label_number(scale = 1e-6, suffix = "M")) +
theme_minimal() +
theme(plot.title = element_blank())
train_p <- ggplot(plot_train, aes(x = Predicted, y = Real)) +
geom_point(alpha = 0.6, color = "lightblue") +
geom_abline(slope = 1, intercept = 0, color = "darkorange", lwd = 0.7) +
labs(
x = "Light GBM Predictions",
y = "Actual Price" ,
subtitle = "Train Data"
) +
scale_x_continuous(labels = scales::label_number(scale = 1e-6, suffix = "M")) +
scale_y_continuous(labels = scales::label_number(scale = 1e-6, suffix = "M")) +
theme_minimal() +
theme(plot.title = element_blank())
grid.arrange(train_p, test_p, ncol = 2)
```Our test RMSLE of 0.1501 is slightly better than the article’s test RMSLE of 0.1694.
Our train RMSLE of 0.1391 is also better than the article’s train RMSLE of 0.1669.
We also included MAPE as an additional metric, providing further insight into the model’s accuracy. Our test MAPE is 0.178, and train MAPE is 0.1164.
The number of estimators in our model was 387, significantly higher than the 64 estimators used in the article. This difference indicates that our model required more trees to achieve optimal performance, which might be due to differences in data preprocessing or parameter tuning.
In conclusion, our LightGBM model performs comparably to the one in the referenced article, with a slightly better fit on the training data. The inclusion of MAPE gives us a clearer picture of prediction errors, emphasizing the model’s practical accuracy in terms of percentage error.
Hybrid Regression
In this section, we describe the creation and evaluation of a hybrid regression model that combines the predictions from three different machine learning models: RandomForest, XGBoost, and LightGBM. Each model contributes equally to the final prediction, enhancing the robustness and accuracy of the prediction by leveraging the strengths of each individual model.
Loading Models and Generating Predictions
We begin by loading the pre-trained models for RandomForest and XGBoost and LightGBM. After loading these models, we generate predictions for the test dataset.
```{r message=FALSE, warning=FALSE, results = "hide"}
data_rf <- readRDS("data_rf_model.rds")
data_xgboost <- readRDS("data_xgboost_model.rds")
source("RR_HousingPricePred_LightGBM.R")
load("housing_prices_data_clean.Rdata")
set.seed(123456789)
data_which_train <- createDataPartition(housing_prices_data_clean$totalPrice ,
p = 0.8,
list = FALSE)
data_train <- housing_prices_data_clean[data_which_train,]
data_test <- housing_prices_data_clean[-data_which_train,]
pred_rf_test <- predict(data_rf, data_test)
pred_xgb_test <- predict(data_xgboost, data_test)
pred_lgb_test <- predict(model_lgbm, as.matrix(data_test[, -which(names(data_test) == "totalPrice")]))
pred_rf_train <- predict(data_rf, data_train)
pred_xgb_train <- predict(data_xgboost, data_train)
pred_lgb_train <- predict(model_lgbm,
as.matrix(data_train[, -which(names(data_train) == "totalPrice")]))
```Combining Predictions
The predictions from the three models are combined using a simple average. This approach ensures that each model’s prediction contributes equally to the final prediction.
```{r}
combined_predictions_test <- (pred_rf_test + pred_xgb_test + pred_lgb_test) / 3
combined_predictions_train <- (pred_rf_train + pred_xgb_train + pred_lgb_train) / 3
combined_predictions_test <- ifelse(combined_predictions_test < 0,
0.01, combined_predictions_test)
combined_predictions_train <- ifelse(combined_predictions_train < 0,
0.01, combined_predictions_train)
```Evaluating the Hybrid Model
We evaluate the performance of the hybrid model using two key metrics: Root Mean Squared Logarithmic Error (RMSLE) and Mean Absolute Percentage Error (MAPE). These metrics provide a clear picture of the model’s accuracy in predicting housing prices.
```{r}
combined_rmsle_test <- Metrics::rmsle(data_test$totalPrice,
combined_predictions_test)
combined_mape_test <- Metrics::mape(data_test$totalPrice,
combined_predictions_test)
cat("Combined Test RMSLE: ", combined_rmsle_test, "\n")
cat("Combined Test MAPE: ", combined_mape_test, "\n")
combined_rmsle_train <- Metrics::rmsle(train_data$totalPrice,
combined_predictions_train)
combined_mape_train <- Metrics::mape(train_data$totalPrice,
combined_predictions_train)
cat("Combined Train RMSLE: ", combined_rmsle_train, "\n")
cat("Combined Train MAPE: ", combined_mape_train, "\n")
```Combined Test RMSLE: 0.1434729
Combined Test MAPE: 0.1566243
Combined Train RMSLE: 0.1370518
Combined Train MAPE: 0.1172748
We compare the results of our hybrid model with those reported in the referenced article. The article reported a Train RMSLE of 0.14969 and a Test RMSLE of 0.16372.
Our hybrid model achieved:
Train RMSLE: 0.1371
Test RMSLE: 0.1435
These results indicate that our model performs better than the article’s model, both on the training and test data.
Plotting the Results
Finally, we visualize the performance of the hybrid model by plotting the actual vs. predicted prices. The plot includes a regression line to illustrate the relationship between predicted and actual prices.
```{r}
tibble(
pred = combined_predictions_test,
actual = data_test$totalPrice
) %>%
ggplot(aes(pred, actual)) +
geom_point(alpha = 0.6, color = "lightblue") +
geom_abline(slope = 1, intercept = 0, color = "darkorange", lwd = 0.7) +
labs(subtitle = "Test Data", x = "Hybrid Model Prediction", y = "Actual Price") +
theme_minimal() +
scale_x_continuous(labels = scales::label_number(scale = 1e-6, suffix = "M")) +
scale_y_continuous(labels = scales::label_number(scale = 1e-6, suffix = "M")) +
theme_minimal() +
theme(plot.title = element_blank())
```In summary, the hybrid regression model combines predictions from RandomForest, XGBoost, and LightGBM to improve the overall prediction accuracy. By averaging the predictions, we leverage the strengths of each model, resulting in a robust and reliable prediction system. Our hybrid model outperforms the referenced model, achieving lower RMSLE on both training and test datasets.
Stacked Generalization
Stacked generalization involves amalgamating the outcomes of individual estimators and employing a regressor to generate the ultimate prediction. This technique leverages the capabilities of each individual estimator by employing their outputs as inputs for a final estimator.
The fundamental concept of this method revolves around utilizing the forecasts generated by preceding models as attributes for another model. Additionally, this strategy incorporates the k-fold cross-validation technique to mitigate the risk of overfitting.
This study employed a prevalent 2-level stacking architecture to predict housing prices. In the initial stacking level, Random Forest and LightGBM were utilized, while XGBoost constituted the second stacking level. Additionally, to accommodate the relatively large dataset, the stacking model was subjected to 5-fold cross-validation.
Our Stacked Generalization model achieved:
Train RMSLE: 0.294
Test RMSLE: 0.304
These results indicate that our model performs worse than the article’s model, both on the training and test data.
Conclusions
We got the expected results, and in the case of Hybrid Regression and LightGBM even better than the authors of the article
LightGBM and Hybrid Regression gave us better results than researchers’
For RF, XGBoost and Stacked Generalization we obtained worse results.
The training set exhibits lower error rates compared to the test set, suggesting a degree of overfitting.